AITopics | testing and evaluation

Collaborating Authors

testing and evaluation

Information about AI from the News, Publications, and Conferences

Automatic Classification – Tagging and Summarization – Customizable Filtering and Analysis

If you are looking for an answer to the question What is Artificial Intelligence? and you only have a minute, then here's the definition the Association for the Advancement of Artificial Intelligence offers on its home page: "the scientific understanding of the mechanisms underlying thought and intelligent behavior and their embodiment in machines."

However, if you are fortunate enough to have more than a minute, then please get ready to embark upon an exciting journey exploring AI (but beware, it could last a lifetime) …

Optimizing Automatic Summarization of Long Clinical Records Using Dynamic Context Extension:Testing and Evaluation of the NBCE Method

Zhang, Guoqing, Fukuyama, Keita, Kishimoto, Kazumasa, Kuroda, Tomohiro

arXiv.org Artificial IntelligenceNov-14-2024

Summarizing patient clinical notes is vital for reducing documentation burdens. Current manual summarization makes medical staff struggle. We propose an automatic method using LLMs, but long inputs cause LLMs to lose context, reducing output quality especially in small size model. We used a 7B model, open-calm-7b, enhanced with Native Bayes Context Extend and a redesigned decoding mechanism to reference one sentence at a time, keeping inputs within context windows, 2048 tokens. Our improved model achieved near parity with Google's over 175B Gemini on ROUGE-L metrics with 200 samples, indicating strong performance using less resources, enhancing automated EMR summarization feasibility.

language model, similarity, summarization, (10 more...)

arXiv.org Artificial Intelligence

2411.08586

Country: Asia > Japan > Honshū > Kansai > Kyoto Prefecture > Kyoto (0.07)

Genre: Research Report (0.64)

Industry:

Information Technology > Security & Privacy (1.00)
Health & Medicine > Health Care Technology > Medical Record (1.00)

Technology:

Information Technology > Security & Privacy (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (0.47)

Add feedback

Design and evaluation of AI copilots -- case studies of retail copilot templates

Furmakiewicz, Michal, Liu, Chang, Taylor, Angus, Venger, Ilya

arXiv.org Artificial IntelligenceJun-17-2024

Building a successful AI copilot requires a systematic approach. This paper is divided into two sections, covering the design and evaluation of a copilot respectively. A case study of developing copilot templates for the retail domain by Microsoft is used to illustrate the role and importance of each aspect. The first section explores the key technical components of a copilot's architecture, including the LLM, plugins for knowledge retrieval and actions, orchestration, system prompts, and responsible AI guardrails. The second section discusses testing and evaluation as a principled way to promote desired outcomes and manage unintended consequences when using AI in a business context. We discuss how to measure and improve its quality and safety, through the lens of an end-to-end human-AI decision loop framework. By providing insights into the anatomy of a copilot and the critical aspects of testing and evaluation, this paper provides concrete evidence of how good design and evaluation practices are essential for building effective, human-centered AI assistants.

ai copilot, copilot, evaluation, (15 more...)

arXiv.org Artificial Intelligence

2407.09512

Country:

North America > United States (0.04)
Europe > Switzerland (0.04)

Genre: Research Report (0.50)

Industry:

Retail (1.00)
Information Technology > Security & Privacy (1.00)
Information Technology > Services (0.67)

Technology:

Information Technology > Artificial Intelligence > Representation & Reasoning (1.00)
Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
(3 more...)

Add feedback

User-Centric Evaluation of ChatGPT Capability of Generating R Program Code

Miah, Tanha, Zhu, Hong

arXiv.org Artificial IntelligenceFeb-5-2024

This paper reports an evaluation of ChatGPT's capability of generating R programming language code from natural language input. A dataset specially designed for generating R program code was constructed with metadata to support scenario-based testing and evaluation of code generation capabilities in various usage scenarios of different levels of difficulty and different types of programs. The evaluation takes a multiple attempt process in which the tester tries to complete the code generation task through a number of attempts until a satisfactory solution is obtained or gives up after a fixed number of maximal attempts. In each attempt the tester formulates a natural language input to ChatGPT based on the previous results and the task to be completed. In addition to the metrics of average numbers of attempts and average amount of time taken to complete the tasks, the final generated solutions are then assessed on a number of quality attributes, including accuracy, completeness, conciseness, readability, well structuredness, logic clarity, depth of ex-planation, and coverage of parameters. Our experiments demonstrated that ChatGPT is in general highly capable of generating high quality R program code as well as textual explanations although it may fail on hard programming tasks. The experiment data also shows that human developers can hardly learn from experiences naturally to improve the skill of using ChatGPT to generate code.

chatgpt, evaluation, test case, (16 more...)

arXiv.org Artificial Intelligence

2402.0313

Country: North America > United States > New York > New York County > New York City (0.04)

Genre:

Research Report > Experimental Study (0.67)
Research Report > New Finding (0.66)

Industry: Education (0.46)

Technology:

Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
Information Technology > Artificial Intelligence > Natural Language > Chatbot (1.00)
Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)

Add feedback

Structured Machine Learning for 'Soft' Classification with Smoothing Spline ANOVA and Stacked Tuning, Testing and Evaluation

Neural Information Processing SystemsApr-6-2023, 19:02:22 GMT

We describe the use of smoothing spline analysis of variance (SS(cid:173) ANOVA) in the penalized log likelihood context, for learning (estimating) the probability p of a '1' outcome, given a train(cid:173) ing set with attribute vectors and outcomes. The smoothing parameters governing f are obtained by an iterative unbiased risk or iterative GCV method. Confidence intervals for these estimates are available. In medical risk factor analysis records of attribute vectors and outcomes (0 or 1) for each example (patient) for n examples are available as training data.

spline anova and stacked tuning, structured machine learning, testing and evaluation, (6 more...)

Neural Information Processing Systems

Technology: Information Technology > Artificial Intelligence > Machine Learning (1.00)

Add feedback

</>aishield.STRAITE - AI Security Solutions for Testing and Evaluation

#artificialintelligenceDec-22-2022, 00:06:27 GMT

At AIShield, we have had an impressive year of growth and achievement. Our team has consistently demonstrated adaptability and a focus on pivoting when necessary, allowing us to make significant progress in our product, business, and team. To drive this progress, we have implemented several strategic initiatives, including an API-first product, targeting key industries, offering free product trials, hosting and launching our product on AWS, building demos, releasing white paper, enabling free security assessment, deploying defenses across the multi-cloud to edge continuum, and providing reference implementations with a python SDK. These efforts have helped us attract and serve many customers and have laid a strong foundation for our business moving forward. Our focus on AI security has enabled us to develop innovative technology that sets us apart from the competition.

ai security solution, strait, testing and evaluation, (6 more...)

#artificialintelligence

Industry: Commercial Services & Supplies > Security & Alarm Services (0.40)

Technology: Information Technology > Artificial Intelligence (1.00)

Add feedback

How the DOD is developing its AI ethics guidance - FedScoop

#artificialintelligenceSep-3-2020, 08:32:40 GMT

It has been six months since the Department of Defense adopted ethical principles for artificial intelligence. Since then, the department's Joint AI Center has faced the daunting challenge of taking that conceptual work and scaling it to develop actionable guidance for the rest of the military. The goal is to give anyone who works in technology development -- from contracting officers to software developers -- a "shared vocabulary" for building ethics into any DOD work involving AI. What's at stake, leaders say, is ensuring that the DOD uses the emerging technology in ways that uphold the department's values while managing potentially huge shifts in the "character" of warfare. The first step is to agree on a document that turns the principles into clear guidance.

artificial intelligence, ethics, jaic, (14 more...)

#artificialintelligence

Country: North America > United States (1.00)

Industry:

Government > Regional Government > North America Government > United States Government (1.00)
Government > Military (1.00)

Technology: Information Technology > Artificial Intelligence > Issues > Social & Ethical Issues (1.00)

Add feedback